Thrill: High-Performance Algorithmic Distributed Batch Data Processing with C++

نویسندگان

  • Timo Bingmann
  • Michael Axtmann
  • Emanuel Jöbstl
  • Sebastian Lamm
  • Huyen Chau Nguyen
  • Alexander Noe
  • Sebastian Schlag
  • Matthias Stumpp
  • Tobias Sturm
  • Peter Sanders
چکیده

We present the design and a first performance evaluation of Thrill – a prototype of a general purpose big data processing framework with a convenient data-flow style programming interface. Thrill is somewhat similar to Apache Spark and Apache Flink with at least two main differences. First, Thrill is based on C++ which enables performance advantages due to direct native code compilation, a more cachefriendly memory layout, and explicit memory management. In particular, Thrill uses template meta-programming to compile chains of subsequent local operations into a single binary routine without intermediate buffering and with minimal indirections. Second, Thrill uses arrays rather than multisets as its primary data structure which enables additional operations like sorting, prefix sums, window scans, or combining corresponding fields of several arrays (zipping). We compare Thrill with Apache Spark and Apache Flink using five kernels from the HiBench suite. Thrill is consistently faster and often several times faster than the other frameworks. At the same time, the source codes have a similar level of simplicity and abstraction. Keywords-C++; big data tool; distributed data processing;

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Scalable Construction of Text Indexes

The suffix array is the key to efficient solutions for myriads of string processing problems in different applications domains, like data compression, data mining, or Bioinformatics. With the rapid growth of available data, suffix array construction algorithms had to be adapted to advanced computational models such as external memory and distributed computing. In this article, we present five s...

متن کامل

Learning and modeling big data

Caused by powerful sensors, advanced digitalisation techniques, and dramatically increased storage capabilities, big data in the sense of large or streaming data sets, very high dimensionality, or complex data formats constitute one of the major challenges faced by machine learning today. In this realm, a couple of typical assumptions of machine learning can no longer be met, such as e.g. the p...

متن کامل

Distributed Scientiic Data Processing Using the Dbc

The Distributed Batch Controller DBC supports scienti c batch data processing The DBC dis tributes batch jobs to one or more pools of workstations and monitors and controls their execution The pools themselves may be geographically distributed and need not be dedicated to process ing batch jobs We describe the use of the DBC in a large scienti c data processing application namely the generation...

متن کامل

Tensor-Based Backpropagation in Neural Networks with Non-Sequential Input

Neural networks have been able to achieve groundbreaking accuracy at tasks conventionally considered only doable by humans. Using stochastic gradient descent, optimization in many dimensions is made possible, albeit at a relatively high computational cost. By splitting training data into batches, networks can be distributed and trained vastly more efficiently and with minimal accuracy loss. We ...

متن کامل

Scienti c Data Processing Using the DBC 1

The Distributed Batch Controller (DBC) supports scientiic batch data processing. The DBC distributes batch jobs to one or more pools of workstations and monitors and controls their execution. The pools themselves may be geographically distributed, and need not be dedicated to processing batch jobs. We describe the use of the DBC in a large scientiic data processing application, namely the gener...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016